Classification
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
yishin@gmail.com
Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes
One of the attributes is the class
Find a model for class attribute:
The model forms a function of the values of other attributes
Goal: previously unseen records should be assigned a
class as accurately as possible.
Atest set is needed
To determine the accuracy of the model
Data Mining @ Yi-Shin Chen 2
Supervised vs. Unsupervised Learning
Supervised learning
Supervision: The training data with class labels
New data is classified based on the training set
Unsupervised learning
The class labels of training data is unknown
Given a set of measurements with the aim of establishing the existence
of clusters in the data
Data Mining @ Yi-Shin Chen 3
Illustrating Classification Task
Data Mining @ Yi-Shin Chen 4
Training
Dataset
Testing
Dataset
Learn Model
Model
Apply Model
Deduction
Learning
Algorithms
Examples of Classification Task
Predicting tumor cells as benign or malignant
Classifying credit card transactions as legitimate or fraudulent
Classifying secondary structures of protein as alpha-helix, beta-
sheet, or random coil
Categorizing news stories
Data Mining @ Yi-Shin Chen 5
Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
Data Mining @ Yi-Shin Chen 6
Decision Tree
Examples of Decision Tree
Data Mining @ Yi-Shin Chen 8
Tid Refund Marital
Status
Taxable
Income
Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
8
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Model: Decision Tree
Training Data
Issues Regarding Classification:
Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Data Mining @ Yi-Shin Chen 9
Issues Regarding Classification: Evaluating
Classification Methods
Predictive accuracy
Speed and scalability
Robustness
Handling noise and missing values
Scalability
Interpretability:
Understanding and insight provided by the model
Goodness of rules
Data Mining @ Yi-Shin Chen 10
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
Attributes are categorical
If continuous-valued, they are discretized in advance
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Data Mining @ Yi-Shin Chen 11
Algorithm for Decision Tree Induction (Contd)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no samples left
Data Mining @ Yi-Shin Chen 12
Training Dataset
Data Mining @ Yi-Shin Chen 13
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Partition Example
Data Mining @ Yi-Shin Chen 14
income student credit_rating class
high no fair no
high no excellent no
medium no fair no
low yes fair yes
medium yes excellent yes
income student credit_rating class
medium no fair yes
low yes fair yes
low yes excellent no
medium yes fair yes
medium no excellent no
income student credit_rating class
high no fair yes
low yes excellent yes
medium no excellent yes
high yes fair yes
Age
<=30 >40
31 .. 40
Tree Induction
Greedy strategy.
Split the records based on an attribute test
That optimizes certain criterion
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Data Mining @ Yi-Shin Chen 15
©Tan, Steinbach, Kumar Introduction to Data Mining
How to Determine the Best Split
Greedy approach:
Nodes with homogeneous class distribution are preferred
Need a measure of node impurity
Entropy
Gini Index
Misclassification error
Data Mining @ Yi-Shin Chen 16
©Tan, Steinbach, Kumar Introduction to Data Mining
Non-homogeneous
High degree of impurity
Homogeneous
Low degree of impurity
Entropy
Entropy measures the amount of randomness or
surprise or uncertainty
Entropy is defined as:
Data Mining @ Yi-Shin Chen 17
 

1
log
1
log,
1
11
1
n
ii
n
iii
n
ii
in
pwhere
pp
p
pppH
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
00.51
Entropy(p,1-p)
Example of Entropy
Data Mining @ Yi-Shin Chen 18
Entropy is a measure of ‘uncertainty in a probability distribution.
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
12
Probability(event 1) = 0.1
Probability(event 2) = 0.9
Entropy = 0.469
Probability(event 1) = 0.5
Probability(event 2) = 0.5
Entropy = 1.0
0.5 0.5
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
12
Attribute Selection Measure: Information Gain
Select the attribute with the highest information gain
()
Data Mining @ Yi-Shin Chen 19


k
jj
jvI
N
vN
parentI
1
Data Mining @ Yi-Shin Chen 20
income student credit_rating class
high no fair no
high no excellent no
medium no fair no
low yes fair yes
medium yes excellent yes
income student credit_rating class
medium no fair yes
low yes fair yes
low yes excellent no
medium yes fair yes
medium no excellent no
income student credit_rating class
high no fair yes
low yes excellent yes
medium no excellent yes
high yes fair yes
Age
<=30 >40
31 .. 40
Entropy =I(2,3)
=0.971
Entropy =0
Entropy =0.971
Attribute Selection Measure: Information Gain
Select the attribute with the highest information gain ()
Data Mining @ Yi-Shin Chen 21


k
jj
jvI
N
vN
parentI
1
age?
0.971
(5 examples) 0
(4 examples)
0.971
(5 examples)
694.0)2,3(
14
5
)0,4(
14
4
)3,2(
14
5
)( IIIageE
246.0)()5,9()( ageEIageGain
Selection Attribute
Data Mining @ Yi-Shin Chen 22
𝐺𝑎𝑖𝑛𝑖𝑛𝑐𝑜𝑚𝑒  0.029
𝐺𝑎𝑖𝑛𝑠𝑡𝑢𝑑𝑒𝑛𝑡  0.151
𝐺𝑎𝑖𝑛𝑐𝑟𝑒𝑑𝑖𝑡 _𝑟𝑎𝑡𝑖𝑛𝑔0.048
𝐺𝑎𝑖𝑛 𝑎𝑔𝑒 0.246
Age Income Student Credit
Rating Buys
Computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Comparison among Splitting Criteria
Data Mining @ Yi-Shin Chen 23
For a 2-class problem:
Decision Tree Based Classification
Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification techniques for
many simple data sets
Data Mining @ Yi-Shin Chen 24
Practical Issues of Classification
Under fitting and Overfitting
Missing Values
Costs of Classification
Data Mining @ Yi-Shin Chen 25
Underfitting and Overfitting
Data Mining @ Yi-Shin Chen 26
Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise
Data Mining @ Yi-Shin Chen 27
Decision boundary is distorted by noise point
Overfitting due to Insufficient Examples
Data Mining @ Yi-Shin Chen 28
Avoid Overfitting in Classification
Two approaches to avoid overfitting
Prepruning
Halt tree construction early
Difficult to choose an appropriate threshold
Postpruning
Remove branches from a “fully grown” tree
Use a set of data different from the training data to decide which is
the “best pruned tree”
Data Mining @ Yi-Shin Chen 29
Metrics for Performance Evaluation
Data Mining @ Yi-Shin Chen 30
Focus on the predictive capability of a model
Rather than how fast it takes to classify or build models,
scalability, etc.
Confusion Matrix:
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
a: TP (true positive) b: FN (false negative)
c: FP (false positive) d: TN (true negative)
Metrics for Performance Evaluation
Data Mining @ Yi-Shin Chen 31
©Tan, Steinbach, Kumar Introduction to Data Mining
Most widely-used metric:
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes a
(TP)
b
(FN)
Class=No c
(FP)
d
(TN)
F
N
FPT
N
T
P
T
N
T
P
dcba
da
Accuracy
Limitation of Accuracy
Consider a 2-class problem
Number of Class 0 examples = 9990
Number of Class 1 examples = 10
If model predicts everything to be class 0, accuracy is 9990/10000
= 99.9 %
Accuracy is misleading because model does not detect any class
1 example
Data Mining @ Yi-Shin Chen 32
©Tan, Steinbach, Kumar Introduction to Data Mining
Cost-Sensitive Measures
Precision is biased towards C(Yes|Yes) & C(Yes|No)
Recall is biased towards C(Yes|Yes) & C(No|Yes)
F-measure is biased towards all except C(No|No)
Data Mining @ Yi-Shin Chen 33
©Tan, Steinbach, Kumar Introduction to Data Mining
Precision p 𝑎
𝑎𝑐
dwcwbwaw
dwaw
4321
41
Accuracy Weighted
Recall r 𝑎
𝑎𝑏
Fmeasure F 2𝑟𝑝
𝑟𝑝2𝑎
2𝑎𝑏𝑐
Test of Significance
Given two models:
Model M1: accuracy = 85%, tested on 30 instances
Model M2: accuracy = 75%, tested on 5000 instances
Can we say M1 is better than M2?
How much confidence can we place on accuracy of M1 and M2?
Can the difference in performance measure be explained as a result of
random fluctuations in the test set?
Data Mining @ Yi-Shin Chen 34
©Tan, Steinbach, Kumar Introduction to Data Mining
Confidence Interval for Accuracy
Prediction can be regarded as a Bernoulli trial
A Bernoulli trial has 2 possible outcomes
Possible outcomes for prediction: correct or wrong
Collection of Bernoulli trials has a Binomial distribution:
x Bin(N, p) x: number of correct predictions
e.g: Toss a fair coin 50 times, how many heads would turn up?
Expected number of heads = N ×p = 50 ×0.5 = 25
Given x (# of correct predictions) or equivalently, accuracy
(ac)=x/N, and N (# of test instances)
Can we predict p (true accuracy of model)?
Data Mining @ Yi-Shin Chen 35
©Tan, Steinbach, Kumar Introduction to Data Mining
Confidence Interval for Accuracy
Data Mining @ Yi-Shin Chen 36
For large test sets (N > 30),
ac has a normal distribution with
mean pand variance
p(1-p)/N
Confidence Interval for p:
Area = 1 -
Z/2 Z1- /2
1
)
/)1(
(2/12/ Z
Npp
pa
ZP c
)(2
442
22/
2
22/2/
22/
ZN
aNaNZZZaN
pccc
Example: Confidence Interval for Accuracy
Consider a model that produces an accuracy of 80% when
evaluated on 100 test instances:
N=100, acc = 0.8
Data Mining @ Yi-Shin Chen 37
)(2
442
22/
2
22/2/
22/
ZN
aNaNZZZaN
pccc
1-Z
0.99 2.58
0.98 2.33
0.95 1.96
0.90 1.65
N50100 500 1000 5000
p(lower) 0.670 0.711 0.763 0.774 0.789
p(upper) 0.888 0.866 0.833 0.824 0.811
Comparing Performance of 2 Models
Given two models, say M1 and M2, which is better?
M1 is tested on D1 (size=n1), found error rate = e1
M2 is tested on D2 (size=n2), found error rate = e2
Assume D1 and D2 are independent
The difference in the error rate is: d = e1 – e2
The variance of d is:
The confidence interval for the true difference is
Data Mining @ Yi-Shin Chen 38
2
)21(2
1
)11(1
2
2
2
1
2
n
ee
n
ee
t
tt
Z
dd
ˆ
2/
Example :Comparing Performance of 2 Models
Given: M1: n1 = 30, e1 = 0.15
M2: n2 = 5000, e2 = 0.25
d = |e2 – e1| = 0.1 (2-sided test)
At 95% confidence level, Z/2=1.96
Data Mining @ Yi-Shin Chen 39
0043.0
5000
)25.01(25.0
30
)15.01(15.0
ˆ
d
128.0100.00043.096.1100.0
t
d
The Problem Of Decision Tree
Data Mining @ Yi-Shin Chen 40
10
12345678910
1
2
3
4
5
6
7
8
910
12345678910
1
2
3
4
5
6
7
8
9
100
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
Advantages/Disadvantages of Decision Trees
Advantages:
Easy to understand
Easy to generate rules
Disadvantages:
May suffer from overfitting.
Classifies by rectangular partitioning (so does not handle
correlated features very well).
Can be quite large – pruning is necessary.
Does not handle streaming data easily
Data Mining @ Yi-Shin Chen 41
Random Forest
Bagging/Bootstrap Aggregating
To reduce the variance of an estimated prediction
function
Basic idea:
A committee of trees each casts a vote for the predicted class
Randomly draw datasets with replacement from the training data
Data Mining @ Yi-Shin Chen 43
Data
Training
Random with replacement
Model 1 Model 2 Model k
Train Train Train
aggregate Output
Testing
Random Forest Classifier
An extension to bagging which uses de-correlated trees
Create bootstrap samples from the training data
Each construct a decision tree
At each node in choosing the split feature choose only among m<M
features
Take the majority vote
Data Mining @ Yi-Shin Chen 44
Naïve Bayes Classifier
Thomas Bayes
1702 - 1761
Example of Histogram
Data Mining @ Yi-Shin Chen 46
Antenna Length
10
12345678910
1
2
3
4
5
6
7
8
9
Katydids
Grasshoppers
With a lot of data, we can build a histogram.
Let us just build one for “Antenna Length” for now…
Example of Histogram (Contd.)
Data Mining @ Yi-Shin Chen 47
10
2
P(Grasshopper | 3) = 10 / (10 +2) = 0.833
P(Katydid | 3) = 2/ (10 + 2) = 0.166
3
Antennae length is 3
p(cj| d) = probability of class cj, given that we have observed dp(cj| d) = probability of class cj, given that we have observed d
Bayes Classifier
A probabilistic framework for solving classification
problems
Conditional Probability:
Bayes theorem:
Data Mining @ Yi-Shin Chen 48
©Tan, Steinbach, Kumar Introduction to Data Mining
)(
)()
(
)|( AP
C
P
C
A
P
ACP
)(
),(
)|(
)(
),(
)|(
CP
CAP
CAP
AP
C
A
P
ACP
Example of Bayes Theorem
Given:
1% of people have a certain genetic defect.
90% of tests for the gene detect the defect.
9.6% of the test are false positives.
Question:
If a person gets a positive test results, what are the odds they
actually have the genetic defect?
Data Mining @ Yi-Shin Chen 49
©Tan, Steinbach, Kumar Introduction to Data Mining
𝑃𝐺1%; 𝑃~𝐺99%
𝑃𝐺𝑋 ?
𝑃𝑋G90%
X= positive test result
𝑃X~G 9.6%
)(
)()
|
(
)|( AP
C
P
C
A
P
ACP

%658.8
99.0096.001.09.0
01.09.0
)(
)()|(
)|(
XP
GPGXP
XGP
Bayesian Classifiers
Consider each attribute and class label as random
variables
Given a record with attributes (A1, A2,…,An)
Goal is to predict class C
Specifically, we want to find the value of C that maximizes P(C|
A1, A2,…,An )
Can we estimate P(C| A1, A2,…,An ) directly from data?
Data Mining @ Yi-Shin Chen 50
©Tan, Steinbach, Kumar Introduction to Data Mining
Bayesian Classifier Approach
Compute the posterior probability P(C | A1, A2, …,
An) for all values of C using the Bayes theorem
Choose value of C that maximizes
P(C | A1, A2, …, An)
Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
How to estimate P(A1, A2, …, An | C )?
Data Mining @ Yi-Shin Chen 51
©Tan, Steinbach, Kumar Introduction to Data Mining
)(
)()|(
)|(
21
21
21
n
n
nAAAP
CPCAAAP
AAACP
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent and each data sample has n attributes
No dependence relation between attributes
By Bayes theorem,
As P(X) is constant for all classes, assign X to the
class with maximum P(X|Ci)*P(Ci)
Data Mining @ Yi-Shin Chen 52
n
kCi
xk
P
Ci
XP 1)|()|(
)( )()|(
)|( XP Ci
P
CiX
P
XCiP
Take Home Example
Data Mining @ Yi-Shin Chen 53
©Tan, Steinbach, Kumar Introduction to Data Mining
P(Refund=Yes|No) = 3/7
P(Refund=No|No) = 4/7
P(Refund=Yes|Yes) = 0
P(Refund=No|Yes) = 1
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7
P(Marital Status=Single|Yes) = 2/7
P(Marital Status=Divorced|Yes)=1/7
P(Marital Status=Married|Yes) = 0
For taxable income:
If class=No: sample mean=110
sample variance=2975
If class=Yes: sample mean=90
sample variance=25
naive Bayes Classifier:
120K)IncomeMarried,
N
o,Refun
d
(X
P(X|Class=No) = P(Refund=No|Class=No)
P(Married| Class=No)
P(Income=120K| Class=No)
= 4/7 4/7 0.0072 = 0.0024
P(X|Class=Yes) = P(Refund=No| Class=Yes)
P(Married| Class=Yes)
P(Income=120K| Class=Yes)
= 1 0 1.2 10-9 = 0
Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X)
=> Class = No
Given a Test Record:
Naïve Bayesian Classifier: Comments
Advantages :
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history etc
E.g., Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
How to deal with these dependencies?
Bayesian Belief Networks
Data Mining @ Yi-Shin Chen 54
Bayesian Networks
Bayesian belief network allows a subset of the
variables conditionally independent
A graphical model of causal relationships
Represents dependency among the variables
Gives a specification of joint probability distribution
Data Mining @ Yi-Shin Chen 55
Bayesian Belief Network: An Example
Data Mining @ Yi-Shin Chen 56
Family
History
LungCancer
PositiveXRay
Smoker
Emphysema
Dyspnea
LC
~LC
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Bayesian Belief Networks
The conditional probability table for the variable
LungCancer:
Shows the conditional probability for each
possible combination of its parents
n
iZParents i
zi
PznzP 1))
(
|(),...,1(
Constructing a Belief Network
Select a set of variables describing the application domain
Choose an ordering of variables
Start with empty network
Add variables to the network one by one based on the ordering
To add i-th variable Xi:
Determine pa(Xi) of variables already in the network (X1, …Xi-1) such that
P(Xi,X1,…Xi-1)=P(Xi|pa (Xi)) (need domain knowledge)
Draw an arc from each variable in pa(Xi) to Xi
Data Mining @ Yi-Shin Chen 57
Neural Networks
Neural Networks
Artificial neuron
Each input is multiplied by a weighting factor.
Output is 1 if sum of weighted inputs exceeds a threshold value;
0 otherwise
Network is programmed by adjusting weights using
feedback from examples
Data Mining @ Yi-Shin Chen 59
Framework of Neural Networks
Data Mining @ Yi-Shin Chen 60
X
1
X
2
X
3
Y
1000
1011
1101
1111
0010
0100
0111
0000
©Tan, Steinbach, Kumar Introduction to Data Mining
Output Y is 1 if at least two of the three inputs are equal to 1.
Training Examples
Data Mining @ Yi-Shin Chen 61
©Tan, Steinbach, Kumar Introduction to Data Mining
X
1
X
2
X
3
Y
1000
1011
1101
1111
0010
0100
0111
0000
X1
X2
X3
Y
Black box
0.3
0.3
0.3 t=0.4
Output
node
Input
nodes
otherwise0
trueis if1
)( where
)04.03.03.03.0( 321
z
zI
XXXIY
Neural Network Model
Data Mining @ Yi-Shin Chen 62
)( tXwIY
iii
Model is an assembly of
inter-connected nodes
and weighted links
Output node sums up
each of its input value
according to the weights
of its links
Compare output node
against some threshold t
Perceptron Model
)( tXwsignY
iii
or
General Structure
Data Mining @ Yi-Shin Chen 63
Output nodes
Input nodes
Hidden nodes
Output vector
Input vector: xi
wij
ijiijj Ow
I
j
I
je
O
1
1
))(1( jjjjj OTOOErr
jk
kkjjj wErrOOErr
)1(
ijijij OErrlww )(
jjj Errl)(
Network Training
The ultimate objective of training
Obtain a set of weights that makes almost all the tuples in the
training data classified correctly
Steps
Initialize weights with random values
Feed the input tuples into the network one by one
For each unit
Compute the net input to the unit as a linear combination of all the
inputs to the unit
Compute the output value using the activation function
Compute the error
Update the weights and the bias
Data Mining @ Yi-Shin Chen 64
Example of Neural Networks
Data Mining @ Yi-Shin Chen 65
1
2
3
4
5
6
x1
x2
x3
w14
w15
w24
w25
w34
w35
w46
w56
Take Home Example
Data Mining @ Yi-Shin Chen 66
Initial input, weight, and bias values
x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 Θ4Θ5Θ6
1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1
The net input and output calculations
Unit j Net input, IjOutput, Oj
4 1*0.2 + 0*0.4 +1*(- 0.5) - 0.4 = -0.7 1/(1 + e0.7) = 0.332
5 -0.3 + 0 + 0.2 + 0.2 = 0.1 1/(1 + e-0.1) = 0.525
6 (-0.3)(0.332) – (0.2)(0.525) + 0.1 = -0.105 1/(1 + e0.105) = 0.474
Take Home Example
Data Mining @ Yi-Shin Chen 67
Calculation for weight and bias updating
Weight or
bias
New value
w46 -0.3+(0.9)(0.1311)(0.332)=-
0.261
Θ4 -0.4+(0.9)(-0.0087)=-0.408
))(1( jjjjj OTOOErr
jk
kkjjj wErrOOErr
)1(
Calculation of the error at each node
Unit
j
Errj
6 (0.474)(1 – 0.474)(1 – 0.474) = 0.1311
5 (0.525)(1 – 0.525)(0.1311)(-0.2) = -0.0065
4 (0.332)(1 – 0.332)(0.1311)(-0.3) = -0.0087
ijijij OErrlww )(
jjj Errl)(
Network Pruning
Terminating conditions
All wij were so small as to be below threshold
The percentage of samples misclassified is below threshold
A prespecified number of runs has expired
Network pruning
Fully connected network will be hard to articulate
Ninput nodes, hhidden nodes and moutput nodes lead to h(m+N)
weights
Pruning: Remove some of the links without affecting classification
accuracy of the network
Data Mining @ Yi-Shin Chen 68
Summary of Neural Networks
Advantages
Prediction accuracy is generally high
Robust, works when training examples contain errors
Fast evaluation of the learned target function
Criticism
Long training time
Difficult to understand the learned function (weights)
Not easy to incorporate domain knowledge
Data Mining @ Yi-Shin Chen 69
Convolutional Neural Network
(CNN)
Size of Training Samples?
Data Mining @ Yi-Shin Chen 71
100 X 50 X 3
inputs
We need at least 10 * 15,728,640,000 images to well train a neural network!
1024
nodes
512
nodes
2
nodes
Intuition
Convolution- Some patterns are small
No need to see the whole image
Similar pattern appears in different regions
Combining patterns to interpret the image
Subsampling an image
Smaller image also has patterns
With less parameters
Data Mining @ Yi-Shin Chen 72
CNN Overview
Data Mining @ Yi-Shin Chen 73
Convolution
Convolution
Max-Pooling
Max-Pooling
Repeat
Flatten
softmax
Label
Examples
Data Mining @ Yi-Shin Chen 74
Sliding Window
Convolution Layer
Data Mining @ Yi-Shin Chen 75
Different Angle
Filters are able to capture patterns in different angles
Data Mining @ Yi-Shin Chen 76
Convolution Layer - Flatten
Data Mining @ Yi-Shin Chen 77
6
8
3
4
6
8
3
4
Recurrent Neural Network
Overview of RNN
Design for obtaining language models
Sequential information all the inputs are dependent
Memory in hidden layer
Data Mining @ Yi-Shin Chen 79
Xt: input at time step t
St: hidden state at time step t
Ot: output at time step t
U,V,W: parameters
Training a RNN Language Model
Get a big corpus of text which is a sequence of words
Feed into RNN-LM; compute output distribution for every step t
i.e. predict probability dist of every word, given words so far
Loss function on step t is cross-entropy between
Predicted probability distribution
The true next word
Average this to get overall loss for entire training set
Data Mining @ Yi-Shin Chen 80
Advantages/Disadvantages
RNN Advantages:
Can process any length input
Computation for step t can use information from many steps back
Model size doesn’t increase for longer input
Same weights applied on every timestep
RNN Disadvantages:
Recurrent computation is slow
In practice, difficult to access information from many steps back
Data Mining @ Yi-Shin Chen 81
Many Others
Other Classification Methods
K-nearest neighbor classifier
Case-based reasoning
Genetic algorithm
Rough set approach
Fuzzy set approaches
Data Mining @ Yi-Shin Chen 83
The k-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space.
The nearest neighbor are defined in terms of Euclidean
distance.
The target function could be discrete- or real- valued.
For discrete-valued, the k-NN returns the most
common value among the k training examples nearest
to xq.
Data Mining @ Yi-Shin Chen 84
.
_+
_xq
+
__
+
_
_
+
Discussion on the k-NN Algorithm
Distance-weighted nearest neighbor algorithm
Weight the contribution of each of the k neighbors according to
their distance to the query point xq
Giving greater weight to closer neighbors
Curse of dimensionality: distance between neighbors
could be dominated by irrelevant attributes.
To overcome it, elimination of the least relevant attributes.
Data Mining @ Yi-Shin Chen 85
Genetic Algorithm
Genetic Algorithm
Genetic Algorithms (GA) were introduced by Holland at
1975
GA is an iterative search technique to find approximate
answers in the search problems
It is inspired by evolutionary biology such as
inheritance, mutation, selection, and crossover
Data Mining @ Yi-Shin Chen 87
Operation
a GA operates on a population of candidate solutions
called chromosomes
Which is composed of numerous genes, represents an
encoding of the problem
Associates with a fitness value evaluated by the fitness function
This fitness value determines the goodness and the survival
ability of the chromosome
Data Mining @ Yi-Shin Chen 88
Chromosome
Chromosomes could be:
Bit strings: (0101 ... 1100)
Real numbers: (0.02 -12.5 ... 0.0 102.3)
Permutations of element: (A1 A5 A9 ... A2 A16)
Lists of rules: (R1 R5 R6 ... R8)
Program elements (line256 …..)
other data structures/types
Data Mining @ Yi-Shin Chen 89
Gene
Algorithm
Initializing the population and evaluating its
corresponding fitness values
While (Not terminated condition)
Produces newer generations
A portion of the chromosomes is selected according to the
survival ability for reproducing offspring
with a probability consistent with their fitness values.
The offspring are generated through crossover and mutation
processes
Data Mining @ Yi-Shin Chen 90